-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf(shuffles): Incrementally retrieve metadata in reduce #3545
Conversation
CodSpeed Performance ReportMerging #3545 will degrade performances by 13.84%Comparing Summary
Benchmarks breakdown
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3545 +/- ##
==========================================
- Coverage 78.06% 77.17% -0.90%
==========================================
Files 728 728
Lines 89967 91917 +1950
==========================================
+ Hits 70236 70936 +700
- Misses 19731 20981 +1250
|
Interesting. One suggestion I have is to maybe see if we can use I wonder if this might simplify the logic here, and it might also fix some interesting OOM issues that I was observing wrt the workers that are holding the metadata objectrefs dying due to OOM. Could we try that and see if it gives us the same effect? We have to figure out where is appropriate to call this on the metadata objectrefs though. I would look into |
Another idea... Looks like these metadatas are being retrieved using a Ray remote function:
My guess is that if the I wonder if there's a better way of dealing with this. Edit: i think this might only be triggered from certain codepaths, and in most cases the metadata is returned as an objectref after execution. This code is so messy... |
Doing some experiments in #3557 but it seems like Doing another round of tests now using an explicit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR looks good to me, but I wonder if we should take a more holistic approach here and maybe just fix how metadata is being passed around the system rather than patch it here specifically for the map/reduce shuffle.
Maybe something really dumb like hooking into a stats/metadata actor? Would love to get thoughts from @samster25
Tried this out, essentially always fetch metadata and cache it upon partition task completion, unless it is the final partition task. Tried this on a few shuffle configurations (1000 x 1000, 2000 x 2000), and it works pretty well, roughly 20s faster. This could be viable? I believe as long as the task is not final, the partition metadata will need to be retrieved anyway for subsequent tasks, so might as well retrieve it once it is complete. Since it is a blocking call we want to pipeline it with running tasks, which should work if we retrieve it during the task awaiting state. |
Made a refactor to have the metadata caching logic on the scheduler instead of on reduce op, ptal! |
|
Tested on TPCH SF 1000 on a 8 x i8.4xlarge node cluster.
Not much difference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea sounds good to me, but wondering if we should just make it always retrieve metadata -- does that affect performance?
@@ -1771,7 +1771,7 @@ def __iter__(self) -> MaterializedPhysicalPlan: | |||
try: | |||
step = next(self.child_plan) | |||
if isinstance(step, PartitionTaskBuilder): | |||
step = step.finalize_partition_task_single_output(stage_id=stage_id) | |||
step = step.finalize_partition_task_single_output(stage_id=stage_id, cache_metadata_on_done=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How cheap is this, and should we just always do this perhaps?
I believe a fetch is also going to be fetched if a user ever tries to display/show a dataframe, since it needs to figure out how large the entire dataframe is using the total row count. If that's the case, maybe we just simplify the logic here and always force a fetch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think doing this on the final step may introduce some regressions if the result set of a .collect
is very large and metadata is not required, i.e. if the dataframe is not displayed, or perhaps it is an intermediate collect step. If it is a show then it should be quite cheap and this eager fetching probably won't help very much.
Incrementally retrieve partition and metadata as fanouts are completed, instead of retrieving only after all are completed. This drastically speeds up large partition shuffles.
Before, we would see pauses between the map and reduce phase. (example below is a 1000x 1000x 100mb shuffle)
Now: